House Prices

Ahmet Yetkin Eser (DatAddict)

Data Source and Aim of Study:

The data taken by kaggle which is community of data scientist and data enthusiasts.

I choose this data because there are a lot of good kernels about this data in Kaggle. Thanks to that, i will have the chance to learn from many people.

1. Step : About Dataset

Download Data and Packages

In [1]:
import pandas as pd
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

train = pd.read_csv("/Users/yetkineser/Desktop/mef Python/final project/data/train.csv")

test = pd.read_csv("/Users/yetkineser/Desktop/mef Python/final project/data/test.csv")

# train = pd.read_csv("C:/Users/A46988/Desktop/housePrice/train.csv")

# test = pd.read_csv("C:/Users/A46988/Desktop/housePrice/test.csv")
In [2]:
# Show first five row for train dataset
train.head()
Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [3]:
# Show first five row for test dataset
test.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

Note : Difference between train and test dataset, train has one more column. Because this dataset normally create a model for SalePrice. I use Training data for my analysis.

Training dataset: You present your data from your "gold standard" and train your model, by pairing the input with expected output.

Test dataset: In order to estimate how well your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)

Description of Data

In [4]:
# Show rows and columns number
train.shape
Out[4]:
(1460, 81)
In [5]:
# Show Column names
train.columns
Out[5]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

Note: .shape and .columns are attributes, not methods, so you don't need to follow these with parentheses ().

In [6]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

You can find detailed description here.

Missing Data

  • I am looking which columns have how many NULL values and their percentage.
In [7]:
# Show number and percentage of Missing data for each column which has missing data
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data[missing_data['Percent']>0]
Out[7]:
Total Percent
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
  • First, I thought delete to columns which has a lot of null columns or delete rows from columns which has few null rows.
In [8]:
# Drop columns (6 columns)
# train.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','LotFrontage'],axis=1,inplace=True)

# Show rows and columns number after drop columns
# train.shape

# We learn some of null values type
# print(train['MasVnrArea'].dtype)
# print(train['MasVnrType'].dtype)

# We fill null MasVnrArea variable with mean of MasVnrArea column
# MasVnrArea_mean = train.MasVnrArea.mean()
# train['MasVnrArea'] = train.MasVnrArea.fillna(MasVnrArea_mean)

# train['MasVnrType'].value_counts(dropna=False)

# We set null values to 'None' which is in the most rows in 
# 'MasVnrType' column
# train['MasVnrType'] = train.MasVnrType.fillna('None')

# train['MasVnrType'].value_counts(dropna=False)

# train.dropna(inplace=True)

# Show rows and columns number after dropping Null columns
# train.shape
  • But after than I thought searching about missing values is part of descriptive analysis.
  • So i am looking columns descriptions and their inside.
  • PoolQC: Pool quality. And null meaning is "No Pool".

  • MiscFeature: Miscellaneous feature not covered in other categories. And null meaning is "None".

  • Alley: Type of alley access to property. And null meaning is "No alley access".

  • Fence: Fence quality. And null meaning is "No Fence".

  • FireplaceQu: Fireplace quality. And null meaning is "No Fireplace".

  • LotFrontage: Linear feet of street connected to property. There is no evidence in data description about null values. But also there is LotArea column, maybe we cand find about null LotFrontage meanings thanks to LotArea columns.

  • GarageCond: Garage condition. GarageType: Garage location. GarageYrBlt: Year garage was built. GarageFinish: Interior finish of the garage. GarageQual: Garage quality.

Null values meaning, columns about garage is "No Garage". But I can't write inside of numerical data(GarageYrBlt) "No Garage" so it will stay null.

  • BsmtExposure: Refers to walkout or garden level walls. BsmtFinType1: Rating of basement finished area. BsmtFinType2: Rating of basement finished area (if multiple types). BsmtCond: Evaluates the general condition of the basement. BsmtQual: Evaluates the height of the basement.

Null values meaning, columns about basement is "No Basement".

  • MasVnrArea: Masonry veneer area in square feet. MasVnrType: Masonry veneer type.

There is no information about null values about Masonry veneer columns.

  • Electrical: Electrical system. There is no information about null value.
In [9]:
train['PoolQC'] = train.PoolQC.fillna("No Pool")
train['MiscFeature'] = train.MiscFeature.fillna("None")
train['Alley'] = train.Alley.fillna("No alley access")
train['Fence'] = train.Fence.fillna("No Fence")
train['FireplaceQu'] = train.FireplaceQu.fillna("No FirePlace")
train['GarageCond'] = train.GarageCond.fillna("No Garage")
train['GarageType'] = train.GarageType.fillna("No Garage")
train['GarageFinish'] = train.GarageFinish.fillna("No Garage")
train['GarageQual'] = train.GarageQual.fillna("No Garage")
train['BsmtExposure'] = train.BsmtExposure.fillna("No Basement")
train['BsmtFinType1'] = train.BsmtFinType1.fillna("No Basement")
train['BsmtFinType2'] = train.BsmtFinType2.fillna("No Basement")
train['BsmtCond'] = train.BsmtCond.fillna("No Basement")
train['BsmtQual'] = train.BsmtQual.fillna("No Basement")

Searching about LotFrontage

In [10]:
train[['LotFrontage','LotArea']].loc[pd.isnull(train['LotFrontage']) == True].head(12)
Out[10]:
LotFrontage LotArea
7 NaN 10382
12 NaN 12968
14 NaN 10920
16 NaN 11241
24 NaN 8246
31 NaN 8544
42 NaN 9180
43 NaN 9200
50 NaN 13869
64 NaN 9375
66 NaN 19900
76 NaN 8475
  • There is no relationship between LotFrontAge NULLs and Lot Area.
  • But i saw in one kernels made in R. They grouped LotFrontAge based on Neighborhood and fill null values their median based on neighborhood. You can find about here
  • I try to do same in my analysis. But i did not do that in python.
In [11]:
# train.groupby('Neighborhood')['LotFrontage'].median()

# neighbor = train[pd.isnull(train['LotFrontage']) == False].groupby('Neighborhood')['LotFrontage'].median()

# Change list to dataframe
# df1 = pd.DataFrame(data=neighbor.index, columns=['Neighborhood_1'])
# df2 = pd.DataFrame(data=neighbor.values, columns=['LotFrontage_1'])
# df = pd.merge(df1, df2, left_index=True, right_index=True)
  • So i find a package in python like R dplyr. So i used it. You can find about dfply in here
In [12]:
#!pip install dfply
from dfply import *

train_2 = train[pd.isnull(train['LotFrontage']) == False]

Neighborhood = (train_2 >> 
    group_by(X.Neighborhood) >> 
    summarize(LotFrontage = X.LotFrontage.median()))

train_3 = (train >> inner_join(Neighborhood, by = 'Neighborhood'))
   
train_3
train_4 = train_3[pd.isnull(train_3['LotFrontage_x']) == True]
train_4
train_5 = (train_4 >> rename(LotFrontage = X.LotFrontage_y) >> select (X.Id, X.LotFrontage))
train_5
train_6 = (train_2 >> select (X.Id, X.LotFrontage))
train_2
train_7 = train_5 >> union(train_6)
train_8 = (train >> select(~X.LotFrontage) >> inner_join(train_7, by = 'Id'))
train = train_8
train[pd.isnull(train['LotFrontage']) == True]
Out[12]:
Id MSSubClass MSZoning LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice LotFrontage

0 rows × 81 columns

Searching about MasVnrArea and MasVnrType

In [13]:
train.loc[pd.isnull(train['MasVnrArea']) == True].head(12)
Out[13]:
Id MSSubClass MSZoning LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice LotFrontage
234 235 60 RL 7851 Pave No alley access Reg Lvl AllPub Inside ... No Pool No Fence None 0 5 2010 WD Normal 216500 65.0
529 530 20 RL 32668 Pave No alley access IR1 Lvl AllPub CulDSac ... No Pool No Fence None 0 3 2007 WD Alloca 200624 74.0
650 651 60 FV 8125 Pave No alley access Reg Lvl AllPub Inside ... No Pool No Fence None 0 5 2008 WD Normal 205950 65.0
936 937 20 RL 10083 Pave No alley access Reg Lvl AllPub Inside ... No Pool No Fence None 0 8 2009 WD Normal 184900 67.0
973 974 20 FV 11639 Pave No alley access Reg Lvl AllPub Corner ... No Pool No Fence None 0 12 2008 New Partial 182000 95.0
977 978 120 FV 4274 Pave Pave IR1 Lvl AllPub Inside ... No Pool No Fence None 0 11 2007 New Partial 199900 35.0
1243 1244 20 RL 13891 Pave No alley access Reg Lvl AllPub Inside ... No Pool No Fence None 0 9 2006 New Partial 465000 107.0
1278 1279 60 RL 9473 Pave No alley access Reg Lvl AllPub Inside ... No Pool No Fence None 0 3 2008 WD Normal 237000 75.0

8 rows × 81 columns

Looking inside of columns

1 - Numerical Column

  • I am looking how numerical columns distributed.
In [14]:
numerical = [f for f in train.columns if train.dtypes[f] != 'object']
numerical.remove('SalePrice')
numerical.remove('Id')
train[numerical].describe()
Out[14]:
MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold LotFrontage
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 1460.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 56.897260 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 46.549315 567.240411 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 70.199658
std 42.300571 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 161.319273 441.866955 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 22.431902
min 20.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 21.000000
25% 20.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 0.000000 223.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 60.000000
50% 50.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 0.000000 477.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 70.000000
75% 70.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 0.000000 808.000000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 80.000000
max 190.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 1474.000000 2336.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 313.000000

8 rows × 36 columns

2 - Categorical Column

  • I am looking at variable count in every categorical column.
  • I am also looking how they distribute according to SalePrice.
In [16]:
from IPython.display import display, HTML

# Assuming that dataframes df1 and df2 are already defined:
#print "Dataframe 1:"
#display(df1)
#print "Dataframe 2:"
#HTML(df2.to_html())

categorical = [f for f in train.columns if train.dtypes[f] == 'object']
for col in categorical:
    count = train.groupby(col)['SalePrice'].count()
    mean = round(train.groupby(col)['SalePrice'].mean()/1000,2)
    min_ = round(train.groupby(col)['SalePrice'].min()/1000,2)
    IQR1 = round(train.groupby(col)['SalePrice'].quantile(.25)/1000,2)
    median = round(train.groupby(col)['SalePrice'].median()/1000,2)
    IQR3 = round(train.groupby(col)['SalePrice'].quantile(.75)/1000,2)
    max_ = round(train.groupby(col)['SalePrice'].max()/1000,2)
    new_df = pd.concat([count, mean, min_, IQR1, median, IQR3, max_], axis=1)
    new_df.columns = ['count', 'mean(1000k)', 'min(1000k)', 'IQR25(1000k)', 'median(1000k)', 'IQR75(1000k)', 'max(1000k)']
    display(new_df.sort_values(by=['count'],ascending=False))
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
MSZoning
RL 1151 191.00 39.30 137.70 174.00 224.70 755.00
RM 218 126.32 37.90 100.00 120.50 140.00 475.00
FV 65 214.01 144.15 174.00 205.95 250.00 370.88
RH 16 131.56 76.00 106.15 136.50 148.61 200.00
C (all) 10 74.53 34.90 44.00 74.70 98.33 133.90
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Street
Pave 1454 181.13 34.90 130.00 163.00 214.00 755.00
Grvl 6 130.19 55.99 88.25 114.25 169.65 228.95
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Alley
No alley access 1369 183.45 34.9 130.50 165.0 217.50 755.00
Grvl 50 122.22 52.5 105.31 119.5 136.75 256.00
Pave 41 168.00 40.0 151.00 172.5 185.00 265.98
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
LotShape
Reg 925 164.75 34.9 120.00 146.00 188.0 582.93
IR1 484 206.10 52.0 150.00 189.00 239.0 755.00
IR2 41 239.83 110.0 175.00 221.00 250.0 538.00
IR3 10 216.04 73.0 167.88 203.57 265.0 375.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
LandContour
Lvl 1311 180.18 34.9 130.00 162.90 212.00 755.0
Bnk 63 143.10 52.5 113.00 139.40 171.25 315.0
HLS 50 231.53 82.5 151.75 222.25 281.35 538.0
Low 36 203.66 39.3 143.00 190.00 263.75 385.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Utilities
AllPub 1459 180.95 34.9 129.95 163.0 214.0 755.0
NoSeWa 1 137.50 137.5 137.50 137.5 137.5 137.5
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
LotConfig
Inside 1052 176.94 34.9 128.00 159.70 207.12 611.66
Corner 263 181.62 52.5 129.25 160.00 220.00 755.00
CulDSac 94 223.85 84.0 156.48 199.26 269.34 625.00
FR2 47 177.93 81.0 143.50 165.00 194.50 394.62
FR3 4 208.48 128.0 169.92 195.45 234.00 315.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
LandSlope
Gtl 1382 179.96 34.9 129.9 161.88 212.0 755.0
Mod 65 196.73 39.3 130.0 186.70 259.5 538.0
Sev 13 204.38 61.0 143.0 185.00 260.0 375.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Neighborhood
NAmes 225 145.85 87.50 127.50 140.00 158.00 345.00
CollgCr 150 197.97 110.00 152.96 197.20 225.72 424.87
OldTown 113 128.23 37.90 105.90 119.00 140.00 475.00
Edwards 100 128.22 58.50 101.50 121.75 145.22 320.00
Somerst 86 225.38 144.15 177.98 225.50 252.92 423.00
Gilbert 79 192.85 141.00 174.00 181.00 197.20 377.50
NridgHt 77 316.27 154.00 253.29 315.00 374.00 611.66
Sawyer 74 136.79 62.38 127.25 135.00 149.46 190.00
NWAmes 73 189.05 82.50 165.15 182.90 205.00 299.80
SawyerW 59 186.56 76.00 145.50 179.90 222.50 320.00
BrkSide 58 124.83 39.30 100.50 124.30 141.18 223.50
Crawfor 51 210.62 90.35 159.25 200.62 239.00 392.50
Mitchel 49 156.27 84.50 131.00 153.50 171.00 271.00
NoRidge 41 335.30 190.00 265.00 301.50 341.00 755.00
Timber 38 242.25 137.50 186.90 228.48 286.12 378.50
IDOTRR 37 100.12 34.90 81.00 103.00 120.50 169.50
ClearCr 28 212.57 130.00 183.75 200.25 242.22 328.00
StoneBr 25 310.50 170.00 213.50 278.00 377.43 556.58
SWISU 25 142.59 60.00 128.00 139.50 160.00 200.00
Blmngtn 17 194.87 159.90 174.00 191.00 213.49 264.56
MeadowV 17 98.58 75.00 83.50 88.00 115.00 151.40
BrDale 16 104.49 83.00 91.00 106.00 118.00 125.00
Veenker 11 238.77 162.50 184.25 218.00 282.00 385.00
NPkVill 9 142.69 127.50 140.00 146.00 148.50 155.00
Blueste 2 137.50 124.00 130.75 137.50 144.25 151.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Condition1
Norm 1260 184.50 34.9 131.50 166.50 219.50 755.0
Feedr 81 142.48 40.0 121.60 140.00 167.50 244.6
Artery 48 135.09 66.5 105.00 119.55 143.00 475.0
RRAn 26 184.40 79.5 152.39 171.50 190.10 423.0
PosN 19 215.18 109.5 167.25 200.00 252.25 385.0
RRAe 11 138.40 87.0 127.75 142.50 156.50 171.0
PosA 8 225.88 180.0 188.75 212.50 244.00 335.0
RRNn 5 212.40 110.0 128.00 214.00 290.00 320.0
RRNe 2 190.75 187.0 188.88 190.75 192.62 194.5
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Condition2
Norm 1445 181.17 34.90 130.00 163.50 214.00 755.0
Feedr 6 121.17 79.50 95.50 127.50 137.00 167.5
Artery 2 106.50 95.00 100.75 106.50 112.25 118.0
PosN 2 284.88 184.75 234.81 284.88 334.94 385.0
RRNn 2 96.75 68.50 82.62 96.75 110.88 125.0
PosA 1 325.00 325.00 325.00 325.00 325.00 325.0
RRAe 1 190.00 190.00 190.00 190.00 190.00 190.0
RRAn 1 136.90 136.90 136.90 136.90 136.90 136.9
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
BldgType
1Fam 1220 185.76 34.9 131.48 167.90 222.00 755.00
TwnhsE 114 181.96 75.5 143.19 172.20 207.38 392.50
Duplex 52 133.54 82.0 118.38 135.98 145.00 206.30
Twnhs 43 135.91 75.0 95.75 137.50 168.75 230.00
2fmCon 31 128.43 55.0 106.88 127.50 142.50 228.95
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
HouseStyle
1Story 726 175.99 34.9 127.00 154.75 209.35 611.66
2Story 445 210.05 40.0 159.50 190.00 240.00 755.00
1.5Fin 154 143.12 37.9 114.62 132.00 159.33 410.00
SLvl 65 166.70 91.0 145.00 164.50 178.00 345.00
SFoyer 37 135.07 75.5 127.50 135.96 148.00 206.30
1.5Unf 14 110.15 76.0 98.18 111.25 120.50 139.40
2.5Unf 11 157.35 101.0 125.00 133.90 163.50 325.00
2.5Fin 8 220.00 104.0 164.25 194.00 223.75 475.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
RoofStyle
Gable 1141 171.48 34.9 128.00 160.0 202.50 755.00
Hip 286 218.88 55.0 139.68 176.5 277.88 745.00
Flat 13 194.69 82.0 143.00 185.0 242.00 274.97
Gambrel 11 148.91 40.0 105.50 139.0 193.75 259.50
Mansard 7 180.57 100.0 145.50 175.0 216.00 265.98
Shed 2 225.00 190.0 207.50 225.0 242.50 260.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
RoofMatl
CompShg 1434 179.80 34.9 129.9 162.0 213.0 745.00
Tar&Grv 11 185.41 82.0 136.0 167.0 249.0 274.97
WdShngl 6 390.25 168.5 278.5 332.5 452.5 755.00
WdShake 5 241.40 190.0 228.0 242.0 260.0 287.00
ClyTile 1 160.00 160.0 160.0 160.0 160.0 160.00
Membran 1 241.50 241.5 241.5 241.5 241.5 241.50
Metal 1 180.00 180.0 180.0 180.0 180.0 180.00
Roll 1 137.00 137.0 137.0 137.0 137.0 137.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Exterior1st
VinylSd 515 213.73 37.90 160.00 200.00 248.61 611.66
HdBoard 222 163.08 83.00 129.62 149.90 179.90 755.00
MetalSd 220 149.42 62.38 117.75 139.00 164.78 392.00
Wd Sdng 206 149.84 34.90 112.62 138.94 167.25 745.00
Plywood 108 175.94 82.50 143.44 167.45 197.50 345.00
CemntBd 61 231.69 75.00 119.50 236.50 303.48 556.58
BrkFace 50 194.57 40.00 134.52 165.75 245.12 430.00
WdShing 26 150.66 87.00 109.75 128.70 160.25 385.00
Stucco 25 162.99 58.50 116.00 144.00 188.70 381.00
AsbShng 20 107.39 35.31 85.75 108.00 133.50 165.50
BrkComm 2 71.00 60.00 65.50 71.00 76.50 82.00
Stone 2 258.50 230.00 244.25 258.50 272.75 287.00
AsphShn 1 100.00 100.00 100.00 100.00 100.00 100.00
CBlock 1 105.00 105.00 105.00 105.00 105.00 105.00
ImStucc 1 262.00 262.00 262.00 262.00 262.00 262.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Exterior2nd
VinylSd 504 214.43 37.90 161.55 200.07 249.10 611.66
MetalSd 214 149.80 62.38 118.62 138.75 164.52 392.00
HdBoard 207 167.66 83.00 130.50 155.00 182.50 755.00
Wd Sdng 197 148.39 34.90 114.50 138.00 164.90 475.00
Plywood 142 168.11 76.50 135.00 160.75 192.75 320.00
CmentBd 60 230.09 75.00 118.38 238.75 304.08 556.58
Wd Shng 38 161.33 55.00 117.00 138.22 172.31 538.00
Stucco 26 155.91 58.50 92.26 142.00 197.25 381.00
BrkFace 25 195.82 40.00 137.00 160.00 250.00 430.00
AsbShng 20 114.06 35.31 94.00 111.00 135.50 225.00
ImStucc 10 252.07 88.00 131.25 187.60 305.50 745.00
Brk Cmn 7 126.71 60.00 114.00 147.00 148.50 155.00
Stone 5 158.22 90.00 93.50 177.00 200.62 230.00
AsphShn 3 138.00 100.00 119.50 139.00 157.00 175.00
CBlock 1 105.00 105.00 105.00 105.00 105.00 105.00
Other 1 319.00 319.00 319.00 319.00 319.00 319.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
MasVnrType
None 864 156.22 34.90 118.99 143.00 181.88 745.00
BrkFace 445 204.69 75.00 149.30 181.00 236.00 755.00
Stone 128 265.58 119.00 194.65 246.84 312.78 611.66
BrkCmn 15 146.32 89.47 114.25 139.00 163.95 277.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
ExterQual
TA 906 144.34 34.9 118.59 139.45 165.50 381.0
Gd 488 231.63 52.0 185.00 220.00 265.98 745.0
Ex 52 367.36 160.0 311.40 364.61 428.79 755.0
Fa 14 87.99 39.3 60.25 82.25 102.00 200.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
ExterCond
TA 1282 184.03 34.9 131.10 167.37 217.33 755.0
Gd 146 168.90 68.4 128.62 151.25 187.38 625.0
Fa 28 102.60 39.3 65.50 95.75 137.75 169.5
Ex 3 201.33 118.0 139.50 161.00 243.00 325.0
Po 1 76.50 76.5 76.50 76.50 76.50 76.5
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Foundation
PConc 647 225.23 78.00 173.00 205.00 262.14 755.00
CBlock 634 149.81 34.90 122.12 141.50 168.00 402.86
BrkTil 146 132.29 37.90 102.40 125.25 151.25 475.00
Slab 24 107.37 39.30 89.38 104.15 118.88 198.50
Stone 6 165.96 102.78 109.62 126.50 233.73 266.50
Wood 3 185.67 143.00 153.50 164.00 207.00 250.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
BsmtQual
TA 649 140.76 34.9 117.0 135.50 158.00 475.0
Gd 618 202.69 75.5 165.1 192.07 234.00 538.0
Ex 121 327.04 123.5 255.5 318.00 378.50 755.0
No Basement 37 105.65 39.3 90.0 101.80 118.86 198.5
Fa 35 115.69 61.0 91.4 112.00 133.50 206.9
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
BsmtCond
TA 1311 183.63 34.9 132.5 165.00 215.00 755.00
Gd 65 213.60 119.0 155.0 193.88 264.56 465.00
Fa 45 121.81 55.0 91.5 118.50 137.90 265.98
No Basement 37 105.65 39.3 90.0 101.80 118.86 198.50
Po 2 64.00 61.0 62.5 64.00 65.50 67.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
BsmtExposure
No 953 165.65 34.90 126.00 154.00 191.00 501.84
Av 221 206.64 35.31 145.00 185.85 248.90 745.00
Gd 134 257.69 61.00 171.98 226.98 316.32 755.00
Mn 114 192.79 78.00 138.82 182.45 228.09 475.00
No Basement 38 107.94 39.30 90.00 104.02 118.94 198.50
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
BsmtFinType1
Unf 430 170.67 37.90 118.10 161.75 202.62 582.93
GLQ 418 235.41 80.00 173.12 213.75 277.38 755.00
ALQ 220 161.57 81.00 129.00 149.25 179.92 538.00
BLQ 148 149.49 35.31 122.75 139.10 163.10 359.10
Rec 133 146.89 34.90 123.00 142.00 161.50 381.00
LwQ 74 151.85 75.50 119.00 139.00 179.12 301.00
No Basement 37 105.65 39.30 90.00 101.80 118.86 198.50
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
BsmtFinType2
Unf 1256 184.69 34.90 132.00 167.00 219.50 755.0
Rec 54 164.92 85.00 129.00 148.75 180.18 402.0
LwQ 46 164.36 88.00 134.50 154.00 179.75 287.0
No Basement 38 110.35 39.30 90.00 104.02 118.94 284.0
BLQ 33 151.10 62.38 129.00 143.00 172.50 271.9
ALQ 19 209.94 123.50 139.00 174.90 262.50 555.0
GLQ 14 180.98 75.50 130.25 203.12 231.50 270.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Heating
GasA 1428 182.02 34.9 130.50 164.50 214.12 755.0
GasW 18 166.63 82.0 116.38 134.95 220.00 375.0
Grav 7 75.27 37.9 56.25 79.00 88.25 121.0
Wall 4 92.10 87.5 89.38 91.45 94.18 98.0
OthW 2 125.75 122.0 123.88 125.75 127.62 129.5
Floor 1 72.50 72.5 72.50 72.50 72.50 72.5
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
HeatingQC
Ex 741 214.91 66.5 154.9 194.7 252.00 755.0
TA 428 142.36 34.9 115.0 135.0 158.93 375.0
Gd 241 156.86 52.0 122.5 152.0 179.20 395.0
Fa 49 123.92 37.9 87.0 123.5 161.00 235.0
Po 1 87.00 87.0 87.0 87.0 87.00 87.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
CentralAir
Y 1365 186.19 52.0 134.8 168.0 219.21 755.00
N 95 105.26 34.9 82.0 98.0 128.50 265.98
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Electrical
SBrkr 1334 186.83 37.9 134.57 170.00 220.75 755.0
FuseA 94 122.20 34.9 98.50 121.25 143.53 239.0
FuseF 27 107.68 39.3 88.50 115.00 129.95 169.5
FuseP 3 97.33 73.0 77.50 82.00 109.50 137.0
Mix 1 67.00 67.0 67.00 67.00 67.00 67.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
KitchenQual
TA 735 139.96 34.9 116.02 137.00 160.00 375.0
Gd 586 212.12 79.0 172.50 201.40 242.75 625.0
Ex 100 328.55 86.0 263.47 316.75 385.31 755.0
Fa 39 105.57 39.3 81.00 115.00 131.50 200.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Functional
Typ 1360 183.43 34.9 131.30 165.50 217.12 755.0
Min2 34 144.24 87.0 124.25 140.00 164.24 316.6
Min1 31 146.39 82.0 123.75 139.00 167.45 256.0
Mod 15 168.39 55.0 102.75 137.90 195.00 538.0
Maj1 14 153.95 60.0 130.00 140.75 189.12 315.0
Maj2 5 85.80 52.0 67.00 85.00 110.00 115.0
Sev 1 129.00 129.0 129.00 129.00 129.00 129.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
FireplaceQu
No FirePlace 690 141.33 34.90 112.00 135.00 164.38 342.64
Gd 380 226.35 90.35 159.67 206.95 274.98 611.66
TA 313 205.72 82.50 165.00 187.50 236.00 745.00
Fa 33 167.30 117.00 143.00 158.00 187.00 262.00
Ex 24 337.71 130.50 265.07 314.25 411.32 755.00
Po 20 129.76 60.00 120.92 131.50 149.50 172.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
GarageType
Attchd 870 202.89 60.00 150.00 185.0 235.00 755.00
Detchd 387 134.09 35.31 110.00 129.5 147.70 475.00
BuiltIn 88 254.75 91.00 180.75 227.5 300.72 582.93
No Garage 81 103.32 34.90 82.50 100.0 124.00 200.50
Basment 19 160.57 55.99 132.25 148.0 174.00 359.10
CarPort 9 109.96 82.00 95.00 108.0 110.00 164.90
2Types 6 151.28 87.00 146.25 159.0 170.25 186.70
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
GarageFinish
Unf 605 142.16 35.31 116.0 135.0 157.00 475.00
RFn 422 202.07 68.40 158.0 190.0 232.00 582.93
Fin 352 240.05 82.50 176.0 215.0 283.85 755.00
No Garage 81 103.32 34.90 82.5 100.0 124.00 200.50
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
GarageQual
TA 1311 187.49 35.31 135.50 170.00 220.00 755.0
No Garage 81 103.32 34.90 82.50 100.00 124.00 200.5
Fa 48 123.57 64.50 99.50 115.00 135.61 256.0
Gd 14 215.86 90.35 165.98 209.12 271.85 325.0
Ex 3 241.00 120.50 124.00 127.50 301.25 475.0
Po 3 100.17 67.00 81.75 96.50 116.75 137.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
GarageCond
TA 1326 187.89 35.31 135.00 170.0 221.00 755.0
No Garage 81 103.32 34.90 82.50 100.0 124.00 200.5
Fa 35 114.65 40.00 89.00 114.5 137.50 220.0
Gd 9 179.93 118.50 144.00 148.0 180.00 302.0
Po 7 108.50 67.00 101.25 108.0 122.50 137.0
Ex 2 124.00 120.50 122.25 124.0 125.75 127.5
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
PavedDrive
Y 1340 186.43 35.31 134.34 168.50 220.0 755.00
N 90 115.04 34.90 84.62 111.00 137.1 265.98
P 30 132.33 79.50 102.75 132.25 159.1 215.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
PoolQC
No Pool 1453 180.40 34.9 129.90 162.9 213.50 755.00
Gd 3 201.99 160.0 165.50 171.0 222.98 274.97
Ex 2 490.00 235.0 362.50 490.0 617.50 745.00
Fa 2 215.50 181.0 198.25 215.5 232.75 250.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
Fence
No Fence 1179 187.60 35.31 132.75 173.00 224.70 755.0
MnPrv 157 148.75 40.00 119.90 137.45 158.00 745.0
GdPrv 59 178.93 108.00 144.95 167.50 189.48 475.0
GdWo 54 140.38 34.90 118.62 138.75 151.75 381.0
MnWw 11 134.29 110.00 121.50 130.00 143.00 187.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
MiscFeature
None 1406 182.05 34.90 130.00 164.25 215.00 755.0
Shed 49 151.19 55.99 120.00 144.00 172.00 277.0
Gar2 2 170.75 151.50 161.12 170.75 180.38 190.0
Othr 2 94.00 55.00 74.50 94.00 113.50 133.0
TenC 1 250.00 250.00 250.00 250.00 250.00 250.0
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
SaleType
WD 1267 173.40 34.9 128.93 158.00 202.95 755.00
New 122 274.95 113.0 194.05 247.45 341.92 611.66
COD 43 143.97 60.0 112.00 139.00 160.50 287.00
ConLD 9 138.78 79.9 85.00 140.00 160.00 235.13
ConLI 5 200.39 110.0 115.00 125.00 200.00 451.95
ConLw 5 143.70 60.0 95.00 144.00 207.50 212.00
CWD 4 210.60 136.0 152.50 188.75 246.85 328.90
Oth 3 119.85 93.5 104.78 116.05 133.02 150.00
Con 2 269.60 215.2 242.40 269.60 296.80 324.00
count mean(1000k) min(1000k) IQR25(1000k) median(1000k) IQR75(1000k) max(1000k)
SaleCondition
Normal 1198 175.20 39.30 130.00 160.00 205.00 755.00
Partial 125 272.29 113.00 193.88 244.60 339.75 611.66
Abnorml 101 146.53 34.90 104.00 130.00 172.50 745.00
Family 20 149.60 82.50 115.50 140.50 170.25 259.00
Alloca 12 167.38 55.99 116.38 148.15 202.04 359.10
AdjLand 4 104.12 81.00 81.75 104.00 126.38 127.50

About Sales Price

  • First we look at distribution of sales price(1000k).
In [17]:
sales_price = round(train['SalePrice']/1000,2)
round(sales_price.describe(),2)
Out[17]:
count    1460.00
mean      180.92
std        79.44
min        34.90
25%       129.98
50%       163.00
75%       214.00
max       755.00
Name: SalePrice, dtype: float64
In [18]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [19]:
sns.distplot(train['SalePrice'], kde = True, color = 'orange', hist_kws={'alpha': 0.5})
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a0a59d940>
In [20]:
import scipy.stats as st
y = train['SalePrice']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a12cde3c8>
  • I am looking at skewness and kurtosis of SalePrice column.
  • you can find more about skewness and kurtosis here skewness and kurtosis here
In [21]:
import matplotlib
import numpy as np
matplotlib.rcParams['figure.figsize'] = (15.0, 9.0)
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()
Out[21]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a12ceed30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1302c470>]], dtype=object)
In [22]:
#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
Skewness: 1.882876
Kurtosis: 6.536282

Looking Data with graphs

Other Numerical Data Distribution

  • I am looking how numerical data distribution.
  • Firstly i define numerical columns.
In [23]:
quantitative = [f for f in train.columns if train.dtypes[f] != 'object'] 
quantitative.remove('SalePrice')
quantitative.remove('Id')

f = pd.melt(train, value_vars=quantitative)
g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
  • From graphs looking we can say few of them has normal distribution.

Reletionship between Numerical Data and SalePrice

  • Firstly, i plot scatter plot with pairplot but i do not like it.
  • Also i defined categorical columns.
In [24]:
qualitative = [f for f in train.columns if train.dtypes[f] == 'object']

def scatter(x, y, **kwargs):
    plt.scatter(x = x, y =  y)
    plt.xticks(rotation=90)

def pairplot(x, y, **kwargs):
    ax = plt.gca()
    ts = pd.DataFrame({'time': x, 'val': y})
    ts = ts.groupby('time').mean()
    ts.plot(ax=ax)
    plt.xticks(rotation=90)


f = pd.melt(train, id_vars=['SalePrice'], value_vars=quantitative)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=7)
g = g.map(scatter, "value", "SalePrice")
#g = g.map(pairplot, "value", "SalePrice")
  • When i look graphs some of columns hightly correlated with SalePrice.
  • Also i can say some of t

Reletionship between Categorical Data and SalePrice

  • I am looking relation ships categorical data and SalePrice.
  • I am using boxplot and swarmplot together.
  • boxplot: looking for median, IQR25, IQR75 and outliers.
  • swarmplot: looking how data distributed.
In [25]:
train_2= train
train_2[['YrSold']] = train_2[['YrSold']].astype('object')

qualitative = [f for f in train_2.columns if train_2.dtypes[f] == 'object']


def boxplot(x, y, **kwargs):
    sns.boxplot(x = x, y = y, color = "white", showfliers=False) # no outliers for boxplot
    plt.xticks(rotation=90)

def swarmplot(x, y, **kwargs):    
    sns.swarmplot(x = x, y = y, size=2)
    plt.xticks(rotation=90)
    

    
f = pd.melt(train, id_vars = ['SalePrice'], value_vars = qualitative)
g = sns.FacetGrid(f, col = "variable",  col_wrap = 2, sharex = False, sharey = False, size = 8)
g = g.map(swarmplot, "value", "SalePrice")
g = g.map(boxplot, "value", "SalePrice")

Looking SalePrice Relationship with Other Columns

Correlation with Numerical Data

  • First we look all numerical columns correlations.
In [26]:
#correlation matrix
corrmat = train.corr()
f, ax = plt.subplots(figsize=(20, 9))
sns.heatmap(corrmat, vmax=.8, annot=True);
  • Looking for most correlateds columns with SalePrice.
In [27]:
# most correlated features
corrmat = train.corr()
top_corr_features = corrmat.index[abs(corrmat["SalePrice"])>0.6]
plt.figure(figsize=(10,10))
g = sns.heatmap(train[top_corr_features].corr(),annot=True,cmap="RdYlGn")
  • Most corelated numerical columns with SalePrice are:

  • OverallQual: Rates the overall material and finish of the house.

  • GrLivArea: Above grade (ground) living area square feet.

  • GarageCars: Size of garage in car capacity.

  • GarageArea: Size of garage in square feet

  • TotalBsmtSF: Total square feet of basement area.

  • 1stFlrSF: First Floor square feet.

In [36]:
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', '1stFlrSF']
sns.pairplot(train[cols], size = 2.5, diag_kind="kde")
plt.show();
  • They are corelated SalePrice but also they are highly correlated each other.
In [35]:
def regplot(x, y, **kwargs):
    sns.regplot(x = x, y =  y)
    plt.xticks(rotation=90)
    
f = pd.melt(train[cols], id_vars=['SalePrice'])# , value_vars=quantitative
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=7)
g = g.map(regplot, "value", "SalePrice")
  • We can see some outliers from graphs.

Correlation with Categorical Data

  • First we change our categorical data to dummy variable.
  • And concaneta it with SalePrice.
In [31]:
import pandas as pd

dummies = pd.get_dummies(train[qualitative])
SalePrice = train['SalePrice']
frames = [SalePrice, dummies]
dummies_2 = pd.concat(frames, axis=1)
dummies_2.head()
Out[31]:
SalePrice MSZoning_C (all) MSZoning_FV MSZoning_RH MSZoning_RL MSZoning_RM Street_Grvl Street_Pave Alley_Grvl Alley_No alley access ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 208500 0 0 0 1 0 0 1 0 1 ... 0 0 0 1 0 0 0 0 1 0
1 181500 0 0 0 1 0 0 1 0 1 ... 0 0 0 1 0 0 0 0 1 0
2 223500 0 0 0 1 0 0 1 0 1 ... 0 0 0 1 0 0 0 0 1 0
3 140000 0 0 0 1 0 0 1 0 1 ... 0 0 0 1 1 0 0 0 0 0
4 250000 0 0 0 1 0 0 1 0 1 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 272 columns

In [32]:
corrmat = dummies_2.corr()
top_corr_features = corrmat.index[abs(corrmat["SalePrice"])>0.5]
plt.figure(figsize=(10,10))
g = sns.heatmap(dummies_2[top_corr_features].corr(),annot=True,cmap="RdYlGn")
  • Most corelated categorical columns with SalePrice are:

  • ExterQual: Evaluates the quality of the material on the exterior.

    • Ex Excellent
    • Gd Good
    • TA Average/Typical
    • Fa Fair
    • Po Poor
  • BsmtQual: Evaluates the height of the basement

    • Ex Excellent (100+ inches)
    • Gd Good (90-99 inches)
    • TA Typical (80-89 inches)
    • Fa Fair (70-79 inches)
    • Po Poor (<70 inches
    • NA No Basement
  • KitchenQual: Kitchen quality

    • Ex Excellent
    • Gd Good
    • TA Typical/Average
    • Fa Fair
    • Po Poor
  • We are looking dummy variables with regression plot.
In [33]:
def regplot(x, y, **kwargs):
    sns.regplot(x = x, y =  y)
    plt.xticks(rotation=90)
    
f = pd.melt(dummies_2[top_corr_features], id_vars=['SalePrice'])# , value_vars=quantitative
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=7)
g = g.map(regplot, "value", "SalePrice")
  • We are also looking these categorical variables with boxplot and swarmplot again.
In [34]:
train_2= train
train_2[['YrSold']] = train_2[['YrSold']].astype('object')

qualitative = [f for f in train_2.columns if train_2.dtypes[f] == 'object']


def boxplot(x, y, **kwargs):
    sns.boxplot(x = x, y = y, color = "white", showfliers=False) # no outliers for boxplot
    plt.xticks(rotation=90)

def swarmplot(x, y, **kwargs):    
    sns.swarmplot(x = x, y = y, size=2)
    plt.xticks(rotation=90)
    


f = pd.melt(train, id_vars = ['SalePrice'], value_vars = ['ExterQual','BsmtQual','KitchenQual'])
g = sns.FacetGrid(f, col = "variable",  col_wrap = 2, sharex = False, sharey = False, size = 8)
g = g.map(swarmplot, "value", "SalePrice")
g = g.map(boxplot, "value", "SalePrice")
In [ ]:
- We can say excellent quality kitchen, basement and exterior material effect price noticeable.

Look 3 Columns Relationships Together

  • BsmtQual: Evaluates the height of the basement,
  • TotalBsmtSF: Total square feet of basement area.
  • Both are highly correlated with SalePrice.
  • We can look TotalBsmtSF effect of SalePrice based on BsmtQual with colorful plot.
In [42]:
sns.lmplot( x="TotalBsmtSF", y="SalePrice", data=train, fit_reg=False, hue="BsmtQual", legend=False)
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x1a149e65c0>
  • When we look at this we do not understand so much. Because colors are mixed.
  • We can look also faced graphs.
In [51]:
g = sns.FacetGrid(train, col="BsmtQual",col_wrap = 2,  size = 8)
g = g.map(plt.scatter, "TotalBsmtSF", "SalePrice", edgecolor="w")
In [ ]:
- If Basement Quality in "Gd", "TA" and "Ex" there is correlation between Sale Price and Total Basement Floor.
In [ ]:
- We can add one more column on our scatter plots.
In [52]:
g = sns.FacetGrid(train, col="BsmtQual", hue="ExterQual",col_wrap = 2,  size = 6)
g = g.map(plt.scatter, "TotalBsmtSF", "SalePrice", edgecolor="w").add_legend()
  • We can learn a lot from this graphs.
  • If BsmtQual = "Fa" or No Basement" or "TA" then ExterQual generally "TA" sometimes "Fa" or
  • If BsmtQual = "Ex" than ExterQual generally "Ex" or "Gd"
  • Most expensive house has "Ex" BsmtQual and "Ex" ExterQual and more than average TotalBsmtSF.

Results

  • I learned how to look data different ways.
  • I learn dplyr way analysis in python.
  • I learned how to find code i needed.
  • I learned to adapt the code the way I want.
  • I learned a lot of people perspectives from kernels.
  • I learned look at relationships between more than two columns with graphs.